Method for annotating points on a hand image to create training dataset for machine learning

ABSTRACT

A method for annotating points on a 2D image of a hand, includescapturing several images of the hand from different views;for each viewpoint, the hand is imaged using cameras including: one first 2D camera and one 3D camera;using a 3D engine for:creating a 3D hand representation from the 3D camera;considering a 3D model of an articulated hand with predefined annotation points;considering several 3D viewpoints of the hand;for each viewpoint considered:modifying the articulated hand to be superimposed with the 3D representation;considering a 2D image captured from the first 2D camera;superimposing the modified articulated hand on the hand captured on the 2D image;applying the annotation points of the modified articulated hand on the hand captured on the 2D image; andstoring the 2D image with annotation on the hand

BACKGROUND

The present invention relates to a method for annotating points on 2Dimages of a hand. Annotating points are key points placed on strategicarea of the hand. The key points allow the hand to be completelyrecognized by computer in an augmented reality or virtual reality.

In computer vision, hand tracking involves extracting the pose of thehand for each image included in a video, or in a sequence of images. Ahand pose can generally be expressed as a collection of the locations ofthe joints of the fingers and the palm. Hand tracking is gainingmomentum in academia and industry due to its broad applications, such ashuman-machine interaction, robotic design, avatar animation, gesturecomprehension and augmented reality. Although a lot of effort has beenput into this area, hand tracking remains very difficult due to theseissues:

-   -   large dimensional configuration space of the hand pose;    -   homogeneous distribution of the colors of the skin of the hand;    -   frequent self-occlusion and occlusion by other objects;    -   quick hand movement.

In order to improve the hand tracking system, it is imperative toeffectively detect a hand. Given the diversity of hands, machinelearning technology seems to be a good candidate for effective handidentification. However, designing a machine learning engine requires alot of training data.

The performance of Machine Learning (ML), whether it is consideredsupervised or unsupervised, depends primarily on the learning trainingdataset used. This training dataset is supposed to be made up of a largepopulation of annotated samples representing the various possiblecontexts of use and the expected results in these situations. Forgeneralization to be possible, that is, for predictions to be correct ondata not present in the training samples, the training dataset must befairly substantial.

In the context of gesture recognition, for example, learning samples aremostly 2D images in which a hand adopts a particular pose. Depending onthe pose, a gesture is recognized and an action is triggered bysoftware. This software is particularly in demand in the context ofAR/VR (Augmented Reality/Virtual Reality) headsets. The problem added bythis material is that the rendering expected by the users is in 3Dwhereas the shooting cameras are generally 2D cameras. Indeed, the AR/VRequipment has a display screen for the user in 3D. It is thereforeimperative to deduce the position of the hand in 3D space from 2Dcameras which are most often monochrome. Training datasets of tenthousand samples minimum are available on the internet. In theory, thesamples should be varied: hands of different ethnicities and sizes,different environments and changing light conditions. If thesevariations are not respected, the result of the learning called “model”may be over-trained.

The creation of substantial training datasets is necessary for each new2D camera used in the AR/VR equipment and each new point of view (frontor back camera of the smartphone, above the console of a car, in anAR/VR, etc.). This creation of samples is usually manual and can beparticularly time consuming. Manual annotation of, say, twenty-twopoints for a 3D image can take up to four minutes with a simple 3Dengine. It involves placing for example twenty-two points on predefinedjoints and areas of the hand. Another disadvantage of the manual methodis that without physical constraints or length benchmarks, the Z (depth)is annotated differently depending on the annotator. For the same hand,the fingers can vary from five to twenty centimeters apart depending onthe annotator. In cases where part of the hand is hidden, the positionof the fingers can be difficult to deduce.

Document US20200151884A1 discloses an automated annotated objecttracking tool that allows machine-learning teams to annotate an objectwithin a frame and have that annotation persist across frames as theannotated object is tracked within a series of frames.

Document W02017031385A1 discloses a user device within a communicationarchitecture, the user device comprising an asynchronous session viewerconfigured to receive asynchronous session data, the asynchronoussession data comprising at least one image, camera pose data associatedwith the at least one image, and surface reconstruction data associatedwith the camera pose data; select a field of view position; and edit theasynchronous session data by adding/amending/deleting at least oneannotation object based on the selected field of view.

Document US20200090409A1 discloses a method comprising the steps ofcapturing a first image at a location and location data associated withthe first image; placing an augmented reality (AR) object within thefirst image with reference to the location data associated with thefirst image; storing the AR object and location data associated with thefirst image; and placing the AR object into a second image when locationdata associated with the second image corresponds to the location dataassociated with the first image.

The object of the present invention is to propose a dynamic solution inthe recognition of gestures. Another goal of the invention is a new fastand efficient method for creating training data for a machine learningengine. Another object of the invention is a new method for creatingrich data to improve learning when machine learning is used.

SUMMARY

At least one of the above-mentioned objects is achieved with a methodfor annotating points on a 2D image of a hand, the method comprising:

-   -   capturing several images of the hand from different points of        view,    -   for each point of view, the hand is imaged using an assembly of        cameras comprising at least: one first 2D camera and one 3D        camera,    -   using a 3D engine for:        -   creating a 3D representation of the hand from the 3D camera,        -   considering a 3D model of an articulated hand wherein            annotation points are predefined,        -   considering several points of view of the hand in the 3D            representation;        -   for each point of view considered, realizing following            steps:    -   modifying the articulated hand so as to be superimposed with the        3D representation,        -   considering a 2D image captured from the first 2D camera,            for example preferably at the same time as images used to            create the 3D representation at the pending point of view,        -   superimposing the modified articulated hand on hand captured            on the 2D image,        -   applying the annotation points of the modified articulated            hand on the hand captured on the 2D image,    -   storing the 2D image with annotation on the hand.

The present invention uses the benefits of 3D images to place annotationpoints on 2D images. Another advantage of the present invention is ahigh precision in the placement of annotation points on the hand. Themethod of the invention is faster than the manual method according tothe prior art.

As the method according to the invention uses real 3D images, thetexture rendering is perfect, which greatly improves training whenmachine learning is used. Computer steps of the present invention can becarried out by a processing unit. With the method according to theinvention each 2D image can be annotated. Advantageously, 2D images canconstitute training dataset that can be used for training anomalydetection model in a machine learning system. Capturing images indifferent points of view allows for large amount of images availableunder various circumstances. Cameras are synchronized so that it ispossible to retrieve images obtained from all cameras at a given sametime.

The invention makes it possible to create training dataset for any kindof 2D camera. The training dataset can be used to create a machinelearning model for a virtual reality or an augmented reality systembased on the first 2D camera.

The camera assembly is arranged so that the first 2D camera isremovable. Each new camera which training dataset has to be determinedtakes the place of the first 2D camera according to the invention. Theannotation phase according to the invention is much faster and moreprecise than the manual annotation of each key point. Since it does notdepend on an iteration algorithm or a machine learning model, it takesplace in real time. The estimated time for annotating the first framecan be for example few minutes and then few seconds for each subsequentframe in the sequence. For a sequence of one minute at 30 FPS, for atotal of 1,800 images, the annotation of the latter would last 2 hoursand 30 minutes. This estimate makes more sense when compared to that ofa conventional 22-points manual annotation that would amount to 150hours for the same volume of frames. The articulated hand is a synthetichand that is predetermined by computer and comprises annotation pointsin the right places.

According to the invention, the first 2D camera can be a monochrome or aRGB or a thermographic camera. The monochrome camera can be a black andwhite camera.

According to an embodiment of the invention, during the step ofcapturing, the hand can keep the same gesture, and the assembly ofcameras turns relatively around the hand. A first case can be thesituation wherein the assembly turns around a fixed hand. Another casecan be the situation wherein the assembly is fixed and the hand pivotson itself while keeping the same gesture. It is therefore possible toobtain several thousand images for a capturing phase of few minutes.Those configurations allow for capturing the hand in different points ofview. Preferably the hand keeps the same gesture so that to acceleratethe step of modifying the articulated hand so as to be superimposed withthe 3D representation. Indeed, after modification of articulated handfor the first images, subsequent images will only need rotation tosuperimpose the articulated hand with the 3D representation.

According to an advantageously embodiment of the invention, the 3Drepresentation of the hand can be a model based on spheres arrangement.This is a simple sphere model using 48 spheres with flexible locationsand empirically defined fixed radii. The model based on spheres can benamed “hand landmark model” or “hand keypoints detection”. The hand isno longer considered as a single unit, but instead is split up as amodel of N keypoint(s) of interest, representative of the differentarticulations. Knowing the position of each finger joint should provideenough information to detect which pose the hand is in or which gesturethe hand is doing.

Each of the five fingers has four degrees of freedom (DoF), two are forthe base of the finger to represent flexion and abduction of the finger,two for flexion for the remaining joints. The whole hand has a 6DoFglobal: 3 DoFs are for global location and the remaining 3 DoFs are forglobal hand orientation. In order to avoid an implausible articulatedhand pose, kinematic constraints are added to the hand model to keep thejoint angles within a valid range.

Advantageously, the assembly of cameras can comprise cameras fixedtogether immobile with respect to one another, the fields of view of thecameras overlap in a common area. Preferably, the common area is thefield of view of the depth camera. Indeed, the depth camera provides the3D positions of each pixel. Some 3D positions are used as landmarks inthe 3D graphical application.

According to the invention a calibration step can be carried out todetermine transfer matrices between cameras of the assembly. Saidmatrices can be used for identifying the same part of the hand indifferent images of the same point of view. The matrices are also usedto project the articulated hand in images of different cameras. Inparticular, when the fields of view of the cameras are different,cameras of the type having a smaller field of view can be added so as tocover the field of view of the camera having the largest field of view.

According to an advantageously embodiment of the invention thearticulated hand has the dimensions of the captured hand.

Advantageously, the 3D model of the articulated hand can be based oninverse kinematics. This method of inverse kinematics allows a freemodification of the pose of the articulated hand and makes it possibleto solve the problems linked to the respect of the size of the elementsand the overall consistency of the hand. For example, during the step ofcapturing, each camera captures images at a same speed, for example 30,45 or 60 frames per second. For example, during the step of capturing,images are stored in a DAT format. The DAT format images can beconverted to PNG format during a step of sorting images. The DAT formatallows to backup all data obtained for all images captured.

According to an advantageously embodiment of the present invention, the3D camera can comprise a second 2D camera, for example of RGB type, anda depth camera; the 3D representation being created by combining datacoming from images captured at the same time from the second 2D cameraand the depth camera.

According to an embodiment of the invention, the creation step of 3Drepresentation comprises a step of transforming images from the second2D camera and/or the depth images to have the same image resolution.When the images from the second 2D camera are RGB type, the depth imagescan for example be converted in the RGB coordinate system. To associateeach RGB pixel with a depth data, transfer matrices can be used toswitch from one image to another.

According to the invention, the cameras can be configured to operate ina six degrees of freedom (6DOF) and the data of the cameras can beprocessed as six degrees of freedom in the 3D engine.

According to an advantageously embodiment of the invention, whenannotation points are determined for a first point of view, for thesubsequent points of view, the step of modifying the articulated hand isautomatically made based on six degrees of freedom data. When the useris satisfied with their pose, 22 annotation points can be stored for thefour images. The next frame of the recording is displayed. The user isnot starting from scratch. As the subject of the video has not changedpose and it is just the camera that has moved, the articulated hand poseis fixed for all frames in the recording. The articulated hand has to betranslated or tilted for t+N frames. If the 3D camera additionally has a6 DOF system, this tracking annotation could be automated. The 3D enginecould take the 6 DOF data recorded for every frame and apply it to avirtual camera. As the cloud of spheres is attached to this virtualcamera, the movement during recording is applied to this camera and thecloud of spheres should automatically position itself in the articulatedhand.

According to another aspect of the invention, it is proposed a computerprogram product, comprising: a computer-readable storage device; and acomputer-readable program code stored in the computer-readable storagedevice, the computer-readable program code containing instructions thatare executed by a central processing unit (CPU) of a computer system toimplement a method of:

-   -   creating a 3D representation of a hand from a 3D image        comprising the hand,    -   considering a 3D model of an articulated hand wherein annotation        points are predefined,        modifying the articulated hand so as to be superimposed with the        3D representation,    -   considering a 2D image of the hand captured at the same time and        according to a same point of view as the 3D image,    -   superimposing the modified articulated hand on the hand captured        on the 2D image,    -   applying the annotation points of the modified articulated hand        on the hand captured on the 2D image,        storing the 2D image with annotation on the hand.

BRIEF DESCRIPTION OF THE DRAWINGS

For the purpose of illustrating the invention, there is shown in thedrawings a form that is presently preferred; it is being understood,however, that this invention is not limited to the precise arrangementsand instrumentalities.

FIG. 1 is a general view of a hand wherein 22 annotation points arerepresented;

FIG. 2 is a general view of a neural network;

FIG. 3 is a general view of a virtual reality headset comprising 2Dcamera and engine according to the invention;

FIG. 4 is a flowchart illustrating a method according to the inventionfor annotating hands on images acquired by a 2D camera according to theinvention;

FIG. 5 is a general view of an assembly of cameras comprising at leastone 2D camera and one 3D camera; and

FIG. 6 is a general view of a 3D representation of a hand build from 3Dcamera and an articulated synthetic hand with predefined annotationpoints.

DETAILED DESCRIPTION

While the invention is susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that the drawings and detaileddescription thereto are not intended to limit the invention to theparticular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thescope of the present invention as defined by the appended claims.

Hereinafter, the present invention will be described in detail byexplaining exemplary embodiments of the invention with reference to theattached drawings.

Although the invention is not limited to it, the following descriptionrefers to a method for representing annotation points on a handdisplayed on images. The images are acquired by using an assembly ofcameras comprising a monochrome 2D camera and a RGB camera associatedwith a depth camera. Advantageously, the monochrome 2D camera is acamera used in a virtual reality headset or in an augmented realityportable device or to any portable computing device which comprisessoftware and hardware components suitable to implement virtual realityand/or augmented reality.

One object of the invention is to build several thousands of 2D imagesof hands annotated in order to feed a machine learning model as trainingdata. Said engine being intended to be integrated in the virtual realityand/or augmented reality device.

On FIG. 1 is illustrated a hand wherein 22 annotation points are placedon joints and other strategic area of the hand. More or less annotationpoints can be used in the same or different places on the hand. Eachfinger comprises several points. Some points are arranged out of thefingers, in areas allowing an effective identification of the wholehand. The aim of the present invention is to quickly create several 2Dimages of hands in different environments (exposure, color . . . ).

FIG. 2 shows a neural network that can be used as machine learning modelto predict a hand. This neural network comprises an input layer fed bytwo descriptors with training dataset. A hidden layer is composed ofnodes to propagate decisions towards an output layer.

There is a sensitivity layer allowing a measure of the quality of themodel according to the proportion of false positives, false negatives,true positives, etc. For example, a true positive can be a hand pixelidentified as a hand element and a false negative can be a backgroundpixel labeled as a hand pixel. The higher the sensitivity value, thegreater the amount of true positives and the fewer false negatives. Lowsensitivity means there are fewer true positives and more falsenegatives.

This machine learning model can be used in a virtual reality (VR)headset 1 as illustrated on FIG. 3. The headset is equipped withsoftware and hardware to implement a motion tracking. The headset isintended to be worn by a user all around the eyes. The user can lookaround a virtual space as if he is actually there.

The VR headset can provide an accurate position tracking as it uses asix-degrees-of-freedom (6DoF) motion position. It can accurately followthe direction and the position of the user: moving forward, backward,up, down, left, or right.

The band 2 is used to maintain the headset on the head.

The front rigid body 3 includes electronic components. Preferably, allcomponents are integrated in the front rigid body 3 in order to optimizecompactness, reliability and processing speed.

A first and second monochrome cameras 4, 5 are visible from the frontside of the headset. Those cameras 4 and 5 are used to capture theenvironment around the user. An illuminator 6 is provided arranged onthe forehead of the headset 1, for example at equal distances from thetwo cameras 4 and 5. The cameras 4 and 5, for example Infrared cameras,are each a 2D camera according to the invention. According to theinvention said cameras have to be used in an assembly of cameras inorder to acquire 2D images based on the method according to theinvention. Then, the machine learning model will be designed usingtraining data obtained thanks to said 2D cameras.

A method according to the present invention will now be described basedon FIGS. 4-6. FIG. 4 shows a flowchart starting with two imagesacquisitions 14 in parallel at the same time. To do that, an assembly ofcameras as illustrated on FIG. 5 is used. The assembly comprises two 2Dcameras 4 and 5. Said cameras are firmly fixed to a 2D RGB camera 7 bymeans of a fixing device. A depth camera 8 is provided under the camera7. The cameras 4, 5, 7 and 8 of the assembly are securely attachedwithout relative movement in relation to each other. This is interestingto facilitate the conversion of coordinates from one camera to another.

Although the assembly on FIG. 5 shows a 2D RGB camera 7 associated witha depth camera 8 or a depth sensor, the combined cameras 7 and 8 canpreferably be replaced by a single 3D camera as illustrated on FIG. 4.

For the sake of simplification, the method according to the invention isillustrated with a removable 2D camera as monochrome camera of an VRheadset and a 3D camera assembled together. The removable 2D camera isone of both cameras 4 and 5, for example camera 4.

Once the two cameras have been merged, the procedure is as follows:

-   -   the subject adopts a pose with his hand 11 and maintains his        pose during the recording of a video;    -   the assembly of cameras turns around the subject's hand, varying        the possible points of view.

The method of the invention comprises images acquisition steps 14 whereall cameras acquire images for different points of view. The acquisitioncan be done at any speed; the most important thing is that the recordingof images from each camera is done at the same time. In other words, agiven moment must be materialized by as many images as there are camerasassociated with the assembly. In the present case, a moment in theacquisition step includes two images: the image of the 2D camera 4 andthe image of the 3D camera. The acquisition is made at 30 FPS,generating 30 sample images in 3D RGB and one monochrome image in onesecond.

The removable 2D camera acquires 2D images of the hand 20.

The removable 3D camera acquires 3D images of the hand 15.

The acquired 3D images 15 are then transferred to a 3D engine that builda 3D representation 16 of the hand 11 for image n, n being a number from1 to the total number of images acquired and that have to be processed.Each 3D image corresponds to a 2D image for the same point of view forthe same time of acquisition.

FIG. 6 shows such a 3D representation 10 designed by means of a spheremodel using 48 spheres with flexible locations and empirically definedfixed radii. Each of the five fingers has 4 degrees of freedom (DoF):two are for the base of the finger to represent flexion and abduction ofthe finger, two for flexion for the remaining joints, while the wholehand has the 6 DoF global (3 DoFs are for global location and theremaining 3 DoFs are for global hand orientation). In order to avoid animplausible articulated hand pose, kinematic constraints are usuallyadded to the hand model to keep the joint angles within a valid range.

FIG. 6 also shows a 3D model of an articulated hand 12 whereinannotation points 13 are predefined. This hand 12 is a synthetic handcorresponding to the proportions of the recorded subject's hand 11. The3D engine is for example designed to be usable with a computer mouse orother input device. The synthetic hand 12 is designed with an inversekinematics (IK) and is preferably modeled from the subject's hand: thesize of the palm and the lengths of the fingers are as close as possibleto reality. Respect for physiognomy is less necessary since inversekinematics is designed as a safeguard and not as an extremely faithfulmodel of the subject's flexibility.

At step 18 on FIG. 4, for each point of view, that means at a givenmoment, the objective is to superimpose the articulated synthetic hand12 on the 3D representation 16. This superimposition can be madeautomatically or by a user.

The user is free to modify the pose of the synthetic hand. The inversekinematics system is present to help solving problems related torespecting the size of the elements and the overall consistency of thehand. The primary objective of the user is to position the syntheticpalm of the hand in the cloud of spheres. Drag & drop allows the user totranslate the synthetic hand 12. For rotation, axes are drawn around the3D representation, the user can grab them and tilt them until they seethe desired orientation.

Once the palm of the synthetic hand is positioned, the user can adjustthe fingers. In this way, the user does not position each phalanx of thesynthetic hand, the reverse kinematics takes care of positioning theother key points of the finger according to the location of the targetand the constraints of the synthetic palm. Predefined poses can beavailable, by choosing the pose of a gesture, the user obtains asynthetic hand adopting the pose of the selected gesture. This allowshim to get closer to the actual pose more quickly, all that remains isto adjust the target slightly with his fingers.

At the end of the handling, a well oriented articulated synthetic hand19 is obtained.

At step 21, the well oriented articulated synthetic hand 19 is projectedon the 2D image 20 coming from the 2D camera 4. This 2D image has beenacquired at the same time, that means the same point of view, as the 3Dimage used to create the 3D representation 16. The projection usestransfer matrices between the 3D image and the 2D image to correctlyposition the well oriented articulated synthetic hand 19 on the handdisplayed on the 2D images 20. The annotation points of the welloriented articulated synthetic hand 19 are then applied on the 2D imageat step 22 to obtain the 2D hand annotated 23.

On step 24 the 2D hand annotated 23 is stored in the training dataset.Then, at step 25, the engine checks if all images are processed. If not,a new 3D representation is created at 16 for a new point of view. Andthis 3D representation is processed up to the storing in the dataset.For the processing of this new 3D representation, the user is notstarting from scratch. Since the subject of the video has not changedpose and it is just the camera that has moved, the synthetic hand poseis fixed for all frames in the recording. The user only has to translateor tilt the synthetic hand for t+N frames.

The method according to the invention has several advantages. It can beadapted to all available cameras since the 2D camera is replaceable. Theannotation phase is much faster and more precise than the manualannotation of each key point. Since it does not depend on an iterationalgorithm or an ML model, it takes place in real time. The estimatedtime for annotating the first image is 2 minutes and then 5 seconds foreach subsequent image in the sequence. For a sequence of 1 minute at 30FPS, for a total of 1,800 images, the annotation of the latter wouldlast 2 hours and 30 minutes. This estimate makes more sense whencompared to that of a conventional 22-points manual annotation whichwould amount to 150 hours for the same volume of images.

With the method according to the invention, training data comprisinghands annotated on 2D images are obtained. The data will be used for thetraining of a machine learning model that will be used in a device; thisdevice also being provided with a 2D camera used for the acquisition of2D images during the method of the invention.

Numerous variations and modifications will become apparent to thoseskilled in the art once the above disclosure is fully appreciated. It isintended that the following claims be interpreted to embrace all suchvariations and modifications.

1. A method for annotating points on a 2D image of a hand, the methodcomprising: capturing several images of the hand from different pointsof view; for each point of view, the hand is imaged using an assembly ofcameras comprising at least: one first 2D camera and one 3D camera;using a 3D engine for: creating a 3D representation of the hand from the3D camera; considering a 3D model of an articulated hand whereinannotation points are predefined; considering several points of view ofthe hand in the 3D representation; for each point of view considered,realizing following steps: modifying the articulated hand so as to besuperimposed with the 3D representation; considering a 2D image capturedfrom the first 2D camera; superimposing the modified articulated hand onhand captured on the 2D image; applying the annotation points of themodified articulated hand on the hand captured on the 2D image; andstoring the 2D image with annotation on the hand.
 2. The methodaccording to claim 1, wherein during the step of capturing, the handkeeps the same gesture, and the assembly of camera turns relativelyaround the hand.
 3. The method according to claim 1, wherein 3Drepresentation of the hand is a model based on spheres arrangement. 4.The method according to claim 1, wherein the assembly of camerascomprises cameras fixed together immobile with respect to one another,the fields of view of the cameras overlap in a common area.
 5. Themethod according to claim 4, wherein the common area is the field ofview of the depth camera.
 6. The method according to claim 1, wherein acalibration step is carried out to determine transfer matrices betweencameras of the assembly.
 7. The method according to claim 1, whereinwhen the fields of view of the cameras are different, cameras of thetype having a smaller field of view are added so as to cover the fieldof view of the camera having the largest field of view.
 8. The methodaccording to claim 1, wherein the articulated hand has the dimensions ofthe captured hand.
 9. The method according to claim 1, wherein the 3Dmodel of the articulated hand is based on inverse kinematics.
 10. Themethod according to claim 1, wherein during the step of capturing, eachcamera captures images at a same speed.
 11. The method according toclaim 1, wherein during the step of capturing, images are stored in aDAT format.
 12. The method according to claim 11, wherein the DAT formatimages are converted to PNG format during a step of sorting images. 13.The method according to claim 1, wherein the 3D camera comprises asecond 2D camera and a depth camera; the 3D representation being createdby combining data coming from images captured at the same time from thesecond 2D camera and the depth camera.
 14. The method according to claim13, wherein the creation step of 3D representation comprises a step oftransforming images from the second 2D camera and/or the depth images tohave the same image resolution.
 15. The method according to claim 1,wherein the first 2D camera is a monochrome or a RGB or a thermographiccamera.
 16. The method according to claim 1, wherein the cameras areconfigured to operate in a six degrees of freedom and the data of thecameras are processed as six degrees of freedom in the 3D engine. 17.The method according to claim 16, wherein when annotation points aredetermined for a first point of view, for the subsequent points of view,the step of modifying the articulated hand is automatically made basedon six degrees of freedom data.
 18. A computer program product,comprising: a computer-readable storage device; and a computer-readableprogram code stored in the computer-readable storage device, thecomputer-readable program code containing instructions that are executedby a central processing unit (CPU) of a computer system to implement amethod of: creating a 3D representation of a hand from a 3D imagecomprising the hand; considering a 3D model of an articulated handwherein annotation points are predefined; modifying the articulated handso as to be superimposed with the 3D representation; considering a 2Dimage of the hand captured at the same time and according to a samepoint of view as the 3D image; superimposing the modified articulatedhand on the hand captured on the 2D image; applying the annotationpoints of the modified articulated hand on the hand captured on the 2Dimage; and storing the 2D image with annotation on the hand.